-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Hybrid Allocator] Support KV cache groups with different block_size #24949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
f"num_heads ({num_heads}) is not " \ | ||
f"divisible by num_kv_heads ({num_kv_heads})" | ||
|
||
# TODO in this PR: only for testing now. remove this hardcode later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self reminder: remove this
for kv_cache_config in kv_cache_configs: | ||
kv_cache_config.num_blocks = min_num_blocks | ||
# TODO: remove this print | ||
print("kv_cache_configs", kv_cache_configs[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self reminder: remove this
attn_layers = get_layers_from_vllm_config(self.vllm_config, Attention) | ||
|
||
# TODO in this PR: revert this | ||
def get_torch_dtype(kv_cache_dtype: str) -> torch.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self reminder: remove this and do it in a future pr
Wait this PR only supports bf16 for full attention and fp8 for sliding window. Trying to fix fp8 for full attention and bf16 for sliding window. |
Signed-off-by: Chen Zhang <[email protected]>
|
Signed-off-by: Chen Zhang <[email protected]>
block_size=32)), | ||
], | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a test for the mixed dtype case?
# Different dtype, align by using different block size
kv_cache_specs_hybrid = {
'layer_1': new_kv_cache_spec(dtype=torch.float8_e4m3fn),
'layer_2': new_sliding_window_spec(dtype=torch.bfloat16),
}
kv_cache_config_hybrid = get_kv_cache_configs(
vllm_config, [kv_cache_specs_hybrid],
[mem_per_block_per_layer * 32])[0]
assert kv_cache_config_hybrid == KVCacheConfig(
num_blocks=32 * 2, # 2x blocks because baseline is BF16 (not FP32)
kv_cache_tensors=[
KVCacheTensor(size=mem_per_block_per_layer * 32,
shared_by=["layer_1", "layer_2"]),
],
kv_cache_groups=[
KVCacheGroupSpec(["layer_1"], new_kv_cache_spec(dtype=torch.float8_e4m3fn, block_size=32)),
KVCacheGroupSpec(["layer_2"],
new_sliding_window_spec(dtype=torch.bfloat16,
block_size=16)),
],
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly could use new_kv_cache_spec
, as it is nothing speficif to new_sliding_window_spec
I'd say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a test for the mixed dtype case?
I think there is no difference on mixed dtype & mixed head size from the view of this PR. Feel free to add tests when you are working on mixed dtype support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly could use new_kv_cache_spec, as it is nothing speficif to new_sliding_window_spec I'd say.
For models only with full attention, we can have a much simpler path because we don't need to ensure all layers have the same page_size_bytes
. I'm working on it in another PR.
This pull request has merge conflicts that must be resolved before it can be |
Edit: I just remembered that DCP doesn't currently support FP8 KV cache, so it seems likely that that's the issue here? Unsure if this is expected, since IIUC this PR is not yet finished, but I get a crash during startup at git clone --branch two_dtype_kv_cache https://github.com/heheda12345/vllm && cd vllm && git reset --hard aaf8bc9366fa270dc0b5eea81dec3a01206bd6ef
VLLM_USE_PRECOMPILED=1 uv pip install --editable .[flashinfer]
vllm serve RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --tensor-parallel-size 4 -dcp 4 --served-model-name default --max-model-len 9216 --kv-cache-dtype fp8_e4m3 It works fine without I've only tested on a 4xH200 machine. Click here for crash logs
Notes:
|
Thanks for catching! I didn't try DCP yet. But why do you need this PR for deepseek-r1? |
Hi @heheda12345, thanks for your comment. I was actually just testing this PR in case it solves a weird bug with DCP inference, seemingly related to incorrect KV cache storage/retrieval, which causes some requests to use the wrong KV data during inference when prefix caching is enabled. From your question, it sounds like this PR is not related to DeepSeek R1/V3. I'm too inexperienced with this stuff to realise that ^^' |
Will rebase it next week to avoid the conflict with #25101 |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Will handle the DCP-related crash after #26296 |
Signed-off-by: Chen Zhang <[email protected]>
Purpose
Hybrid allocator requires all layer has the same physical memory per block now. But models like #24916 (bf16 for sliding window attention and fp8 for full attention) will have different memory per layer.
This pr supports these cases by giving different layers different block_sizes to make the physical memory per block the same. Require one layer's memory per block is a multiple of the other now.
To support prefix caching, we need:
get_longest_cache_hit
and set the alignment requirement to the LCM of all block_sizes.block_size=cache_config.block_size
.For 2, we can generate the block hash with a larger block_size from that with a smaller block size. For example, with block hash of block_size 16, we can get the block hash with block_hash 32 by concatenating two hash value with block_size 16 to one hash value with block_size 32:
block_hash with block_size 16:
block_hash with block_size 32:
Note: for non-hybrid model with different hidden size per layer like #22432 , we may still keep the block size the same for all layers. I plan to do it in a future PR.
Test Plan
Set the kv_dtype of either sliding window attention or full attention to fp8 and run
And also necessary unit tests.
Test Result
Success
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.